Probabilistic Programs as Spreadsheet Queries

نویسندگان

  • Andrew D. Gordon
  • Claudio V. Russo
  • Marcin Szymczak
  • Johannes Borgström
  • Nicolas Rolland
  • Thore Graepel
  • Daniel Tarlow
چکیده

We describe the design, semantics, and implementation of a probabilistic programming language where programs are spreadsheet queries. Given an input database consisting of tables held in a spreadsheet, a query constructs a probabilistic model conditioned by the spreadsheet data, and returns an output database determined by inference. This work extends probabilistic programming systems in three novel aspects: (1) embedding in spreadsheets, (2) dependentlytyped functions, and (3) typed distinction between randomand query-variables. It empowers users with knowledge of statistical modelling to do inference simply by editing textual annotations within their spreadsheets, with no other coding. 1 Spreadsheets and Typeful Probabilistic Programming Probabilistic programming systems [14, 17] enable a developer to write a short piece of code that models a dataset, and then to rely on a compiler to produce efficient inference code to learn parameters of the model and to make predictions. Still, a great many of the world’s datasets are held in spreadsheets, and accessed by users who are not developers. How can spreadsheet users reap the benefits of probabilistic programming systems? Our first motivation here is to describe an answer, based on an overhaul of Tabular [16], a probabilistic language based on annotating the schema of a relational database. The original Tabular is a standalone application that runs fixed queries on a relational database (Microsoft Access). We began the present work by re-implementing Tabular within Microsoft Excel, with the data and program held in spreadsheets. The conventional view is that the purpose of a probabilistic program is to define the random-variables whose marginals are to be determined (as in the query-by-missingvalue of original Tabular). In our experience with spreadsheets, we initially took this view, and relied on Excel formulas, separate from the probabilistic program, for postprocessing tasks such as computing the mode (most likely value) of a distribution, or deciding on an action (whether or not to place a bet, say). We found, to our surprise, that combining Tabular models and Excel formulas is error-prone and cumbersome, particularly when the sizes of tables changes, the parameters of the model change, or we simply need to update a formula for every row of a column. In response, our new design contributes the principle that a probabilistic program defines a pseudo-deterministic query on data. The query is specified in terms of three sorts of variable: (1) deterministic variables holding concrete input data; (2) nondeterministic random-variables constituting the probabilistic model conditioned on input data; and (3) pseudo-deterministic query-variables defining the result of the program (instead of using Excel formulas). Random-variables are defined by draws from a set of builtin distributions. Query-variables are defined via an infer primitive that returns the marginal posterior distributions of random-variables. For instance, given a randomvariable of Boolean type, infer returns the probability p that the variable is true. In theory, infer is deterministic—it has an exact semantics in terms of measure theory; in practice, infer (and hence the whole query) is only pseudo-deterministic, as implementations almost always perform approximate or nondeterministic inference. We have many queries as evidence that post-processing can be incorporated into the language. Our second motivation is to make a case for typeful probabilistic programming in general, with evidence from our experience of overhauling Tabular for spreadsheets. Cardelli [7] identifies the programming style based on widespread use of mechanicallychecked types as typeful programming. Probabilistic languages that are embedded DSLs, such as HANSEI [19], Fun [3], and Factorie [22], are already typeful in that they inherit types from their host languages, while standalone languages, such as BUGS [11] or Stan [35], have value-indexed data schemas (but no user-defined functions). Still, we find that more sophisticated forms of type are useful in probabilistic modelling. We make two general contributions to typeful probabilistic programming. (1) Value-indexed function types usefully organise user-defined components, such as conjugate pairs, in probabilistic programming languages. We allow value indexes in types to indicate the sizes of integer ranges and of array dimensions. We add value-indexed function types for user-defined functions, with a grid-based syntax. The paper has examples of user-defined functions (such as Action in Section 6) showing their utility beyond the fixed repertoire of conjugate pairs in the original Tabular. An important difficulty is to find a syntax for functions and their types that fits with the grid-based paradigm of spreadsheets. (2) A type-based information-flow analysis usefully distinguishes the stochastic and deterministic parts of a probabilistic program. To track the three sorts of variable, each type belongs to a space indicating whether it is: (det) deterministic input data, (rnd) a non-deterministic random-variable defining the probabilistic model of the data, or (qry) a pseudo-deterministic query-variable defining a program result. Spaces allow a single language to define both model and query, while the type system governs flows between the spaces: data flows from rnd to qry via infer, but to ensure that a query needs only a single run of probabilistic inference, there are no flows from qry to rnd. There is an analogy between our spaces and levels in information flow systems: det-space is like a level of trusted data; rnd-space is like a level of untrusted data that is tainted by randomness; and qry is like a level of trusted data that includes untrusted data explicitly endorsed by infer. The benefits of spaces include: (1) to document the role of variables, (2) to slice a program into the probabilistic model versus the result query, and (3) to prevent accidental errors. For instance, only variables in det-space may appear as indexes in types to guarantee that our models can be compiled to the finite factor-graphs supported by inference backends such as Infer.NET [23]. This paper defines the syntax, semantics, and implementation of a new, more typeful Tabular. Our implementation is a downloadable add-in for Excel. For execution on data in a spreadsheet, a Tabular program is sliced into (1) an Infer.NET model for inference, and (2) a C# program to compute the results to be returned to the spreadsheet.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spreadsheet Probabilistic Programming

Spreadsheet workbook contents are simple programs. Because of this, probabilistic programming techniques can be used to perform Bayesian inversion of spreadsheet computations. What is more, existing execution engines in spreadsheet applications such as Microsoft Excel can be made to do this using only built-in functionality. We demonstrate this by developing a native Excel implementation of bot...

متن کامل

Data Debugging (Full Presentation)

Testing and static analysis can help root out bugs in programs, but not in data. We introduce data debugging, an approach that combines program analysis and statistical analysis to find potential data errors. Since it is impossible to know a priori whether data are erroneous or not, data debugging locates data that has an unusual impact on the computation. Such data is either very important, or...

متن کامل

XLSearch: A Search Engine for Spreadsheets

Spreadsheets are end-user programs and domain models that are heavily employed in administration, financial forecasting, education, and science because of their intuitive, flexible, and direct approach to computation. As a result, institutions are swamped by millions of spreadsheets that are becoming increasingly difficult to manage, access, and control. This note presents the XLSearch system, ...

متن کامل

On the Efficient Execution of ProbLog Programs

The past few years have seen a surge of interest in the field of probabilistic logic learning or statistical relational learning. In this endeavor, many probabilistic logics have been developed. ProbLog is a recent probabilistic extension of Prolog motivated by the mining of large biological networks. In ProbLog, facts can be labeled with mutually independent probabilities that they belong to a...

متن کامل

Views on UML Interactions as Spreadsheet Queries

This paper explores the use of table-based representation for artifacts occurring in model-driven development as opposed to graph-based representation. As an example for table-based representation of models, we explain how views on object interaction that are traditionally represented as UML sequence or communication diagrams can be realized by spreadsheet queries.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015